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We propose a simple way of testing whether a given set of observations 
can come from a given theoretical cumulative distribution. In the test 
more weight is attached to the tails of the distribution than in the usual 
Kolmogorov or Smirnov tests. The respective probability distribution is 
derived. 

1 Introduction 

In mathematical statistics it is extremely important to test whether a given 
sample can come from a theoretical cumulative distribution function (CDF) 
and to provide quantitative measures for this hypothesis. The most cele- 
brated tests of this type are the Kolmogorov [1] and Smirnov [2J tests based 
on the supremum of the absolute distance between the observational and 
theoretical or two observational CDFs. For example in the Smirnov test if 
we have two distributions with and Nb number of points respectively and 
the observational CDFs Fa(x) and F B (x) then with the assumption that they 
come from the same CDF the probability for the weighted supremum D of the 



absolute distance between the two, D := sup x \Fa(x)—Fb(x)\/ \j N A + N B , 
is given asymptotically by 



Z7T 



P (D > A) = 2 ^(-l)- 1 e- 2n2x2 = e- (2n - 1)V/(8A2) (1) 

n=l n=l 

When Nb — » oo, i.e the second distribution is treated as a theoretical one, we 
recover the Kolmogorov test. Practical usage of these tests shows however 
that they are less sensitive to the tails of the distribution than to the bulk. In 
practice it is often the case that we observe exceptionally high or low values 
which seem to indicate that they come from a different distribution but the 
difference does not translate into a significant change of the confidence level 
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in the Kolmogorov or Smirnov tests. There are tests that are more sensitive 
to the tails of the distribution but they usually assume normality of the 
distribution. We would like to propose a test which emphasizes the role of 
the tails of the distribution (separately for high and low ends), is independent 
of the normality assumption, simple to apply and has explicitly calculable 
statistical properties. The motivation for the introduction of the test comes 
from the question whether there are statistically significant patterns on the 
maps of the background radiation as measured by the WMAP satelite - the 
description of the problem and the application of the test will be presented 
in a separate publication [3]. 

2 The proposal 

Assume that the theoretical CDF is given by F(x) and the ordered sample 
(with n entries) gives the experimental CDF F n (x) (which is a step function 
increasing by - at the points x-i). The proposed test (right) is based on the 
following quantity 

/n 
(1 - F n {x)) d In (1 - (F(x)Y) = -- ^ In (1 - (F(x,)) a ) (2) 
n i=i 

(and analogous left test A^ n with F — > (1 — F)) where a is a positive real 
number. It is clear that the right test gives more weight to values of F close 
to 1 while the left test to values close to 0. With increasing a we increase 
the relative weight of tails of the distribution (right for A R and left for A L ). 

The test can also be used when we use some "coarse-graining" : we may 
group the total number of observations N into n (1 < n < N) bins with 
positions Xj and di points in the zth bin and use the formula 

A = -^dM{l-{F{xm (3) 
i=i 



2 



3 Properties of the test 

To derive the distribution function for A^ n (the same for A L ) we first calcu- 
late (we denote by A either A R or A L ) 



(e [tA ) = / dz (1 - z a y Ha/n I (4) 



It is straightforward to calculate this quantity with the result 
Defining cumulants of the distribution as 



(it) k a k (A) 

k=l 

so that for example 



HJU) = ^mj^i (6 ) 

k=l 



a x {A) = (A), 

a 2 (A) = (A") -(A)" 

a 3 (A) = (A 3 }-3(A 2 }(A} + 2(Af 

and using the properties of the T function we get the general expression for 
the cumulants a k {A) for the distribution fl5J): 

k— 1 00 / 1 1 \ 

ok (A) = a (*-l)!(i)-E( F -( fTW ) P) 
The probability distribution for A is given by: 

oo 

g a (n;s) = ±- J dte- ist (e itA ) (8) 

— oo 

The integral gives if s < which is consistent with the definition of A - in 
the formulae below we assume henceforth s > 0. For general n and a it is 
straightforward to calculate g a {n\ s) numerically with arbitrary accuracy. 
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In the case (a = 1, n arbitrary) it is possible to perform the integral in 
OH]) analytically and we get 



n n-l / i \ 

i.e. the gamma distribution with (s) = 1 and the variance 1/n. 

Since the distribution flB]) is given as a Fourier transform it is straight- 
forward to give also the expression for the cumulative distribution i.e. the 
probability to get the value of A smaller than a: 

a oo 

/I f 1 _ „-iat 
dsg a (n;s) = — J dt (e itA ) (10) 

-oo 

It is straightforward to calculate G a (n;a) numerically with arbitrary accu- 
racy. 



4 The case of large a and n 

We define 

a 

a := — 

n 

and below we discuss the case a — > oo, n — > oo with a kept finite. 

To derive the limiting distribution in this case we expand T functions and 
we get (up to 1/n and 1/a corrections) 

(e^)=exp (_l_ tO^M ) 
and the distribution function 

oo 

0bo(a; s) = J- / dt e-^-S-^d-^) (12) 
lix J 

— oo 

where 

J oo 

W1 _ a) = i _ ln r(i- a ) = ^-£ j ^ 
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The integral cannot be calculated in a closed form but we can give a closed 
expression for the cumulants of the distribution using (j7j) 

a k = k\ C(k + 1) a k ~ l 

It is also convenient to give directly the cumulative distribution of g^: 



f f 1 - e _i< 

/ ds goo(a; s) = 9ft / dt — — — exp 



mt 
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a a 



1 



ita\ 



(13) 

It is straightforward to calculate numerically this integral for arbitrary a 
and a with any prescribed accuracy - for example Goo(l; 1) — 0.439166..., 
GooCl; 3) = 0.8390636..., G^l; 7) = 0.9898427... and G^l; 17) = 0.999995.... 

For a ^> 1 it is useful to separate the part that is slowly decaying for 
large t from the rest: 



G^a; a) 



,-it 



)>* / dt — exp 

int 



o 



+3ft / dt 



i7rt 



exp 



a a \y J 2a J 
a a \ y 



-exp ( In ( - J + %L) 

\ at a \y J 2a ) 

where y := -. The first term can be easily integrated and we get 

f My) - 7 



Goo (a; a) = exp 



a 



+3ft / dt 



- In (T(l + l/a)) 
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i7rt 



exp I — ijj I 1 

a a \ y 

m \ 



I 7 1 , 
— exp in 

a a 



(14) 



(15) 



The formula can be expanded in inverse powers of a and all the integrals are 
well behaved. It is convenient to organize the series in a slightly different 
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way (the first factor reflects the leading behavior): 



G oo(a; .) = (l-e-») 1 /«(l + f;^] (16) 



cr 

k=2 

where the functions fk(y) can be read off from (j!5p . It is rather straight- 
forward to see that fi{y) = and after rather involved manipulations we 
get: 

/ 2 ( y ) = | In (l (17) 

i=i 



so it is a monotonic function from /2(0) = — to ^2(00) = 0. The form (jT6 
is very useful in actual applications [3]. 



5 Experimental CDF 

The test described in this paper requires the knowledge of the theoretical 
CDF but it is often the case that we have at our disposal only measurements 
of CDF. If we have k such measurements (each with n entries) then we 
calculate at each point the average experimental CDF fi{x) equal to the 
average of all measured CDFs at this point. Therefore at every point \x can 
take one of the values 0, ,1. It is straightforward to prove that for 

the theoretical CDF given by F(x) the probability to measure at this point 
the value /1 is given by 

For large values of kn it tends to the gaussian distribution 



27rknF(l -F) * V 2F(1 - F) 



so that approximates F with dispersion \ F ^ 1 , F ^ 
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6 Conclusions 



The test proposed in the paper requires more work to check its power and 
usefulness and compare it to known non-parametric tests but with its sim- 
plicity and explicit probability distributions it should be useful in statistical 
data analysis. 
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